Imagine you wanted to scrape researchgate.net, since it contains self-created profiles of many researchers. However, when you try to get the html content:
If you don’t know what an HTTP error means, you can go to https://http.cat and have the status explained in a fun way. Below I use a little convenience function:
error_cat <-function(error) { link <-paste0("https://http.cat/images/", error, ".jpg") knitr::include_graphics(link)}error_cat(403)
So what’s going on?
If something like this happens, the server essentially did not fullfill our request
This is because the website seems to have some special requirements for serving the (correct) content. These could be:
specific user agents
other specific headers
login through browser cookies
To find out how the browser manages to get the correct response, we can use the Network tab in the inspection tool
Strategy 1: Emulate what the Browser is Doing
Open the Inspect Window Again:
But this time, we focus on the Network tab:
Here we get an overview of all the network activity of the browser and the individual requests for data that are performed. Clear the network log first and reload the page to see what is going on. Finding the right call is not always easy, but in most cases, we want:
a call with status 200 (OK/successful)
a document type
something that is at least a few kB in size
Initiator is usually “other” (we initiated the call by refreshing)
When you identified the call, you can right click -> copy -> copy as cURL
More on cURL Calls
What is cURL:
cURL is a library that can make HTTP requests.
it is widely used for API calls from the terminal.
it lists the parameters of a call in a pretty readable manner:
the unnamed argument in the beginning is the Uniform Resource Locator (URL) the request goes to
-H arguments describe the headers, which are arguments sent with the call
-d is the data or body of a request, which is used e.g., for uploading things
-o/-O can be used to write the response to a file (otherwise the response is returned to the screen)
--compressed means to ask for a compressed response which is unpacked locally (saves bandwith)
We have seen httr2::curl_translate() in action yesterday
It can also convert more complicated API calls that make look R no diffrent from a regular browser
(Remember: you need to escape all " in the call, press ctrl + F to open the Find & Replace tool and put " in the find \" in the replace field and go through all matches except the first and last):
Essentially, someone pressed a relational database into a list format and we now have to scramble to cope with this monstrosity
Parsing the Json
I could not come up with a better method so far. The only way to extract the data is with a nested for loop going through all days and all entries in the object and looking for elements called “sessions”.
# A tibble: 881 × 4
panel_id panel_name time desc
<int> <chr> <chr> <chr>
1 3113155 PRECONFERENCE: Games and the (Playful) Future of Commun… 2023… "Rec…
2 3113156 PRECONFERENCE: Generation Z and Global Communication 2023… "Gen…
3 3113166 PRECONFERENCE: Nothing About Us, Without Us: Authentic … 2023… "Thi…
4 3113172 PRECONFERENCE: Reimagining the Field of Media, War and … 2023… "As …
5 3113175 PRECONFERENCE: The Legacies of Elihu Katz 2023… "Eli…
6 3112705 Human-Machine Preconference Breakout (room 2) 2023… <NA>
7 3113080 New Avoidance Preconference Breakout (room 2) 2023… <NA>
8 3113150 PRECONFERENCE: 12th Annual Doctoral Consortium of the C… 2023… "The…
9 3113154 PRECONFERENCE: Ethics of Critically Interrogating and R… 2023… "The…
10 3113158 PRECONFERENCE: Human-Machine Communication: Authenticit… 2023… "The…
# ℹ 871 more rows
Extracting paper title and authors
Finally we want to parse the HTML in the description column.
ica_data_df$desc[100]
3113023
"<br /><br /><b>Participants: </b><br /><b><i>(Chairs) </i></b>Wayne Xu, U of Massachusetts Amherst<br /><br /><b>Papers: </b><br />Disentangling the Longitudinal Relationship Between Social Media Use, Political Expression and Political Participation: What Do We Really Know?<br /><i>Jörg Matthes, U of Vienna</i><br /><i>Andreas Nanz, U of Vienna</i><br /><i>Marlis Stubenvoll, U of Vienna</i><br /><i>Ruta Kaskeleviciute, U of Vienna</i><br /><br />Political Discussions on Russian YouTube: How Did They Change Since the Start of the War in Ukraine?<br /><i>Ekaterina Romanova, U of Florida</i><br /><br />Perceptions of and Reactions to Different Types of Incivility in Public Online Discussions: Results of an Online Experiment<br /><i>Marike Bormann, Unviersity of Düsseldorf</i><br /><i>Dominique Heinbach, Heinrich-Heine-U</i><br /><i>Jan Kluck, U of Duisburg-Essen</i><br /><i>Marc Ziegele, Heinrich Heine U</i><br /><br />When Trust in AI Mediates: AI News Use, Public Discussion, and Civic Participation<br /><i>Seungahn Nah, U of Florida</i><br /><i>Chun Shao, Arizona State U</i><br /><i>Ekaterina Romanova, U of Florida</i><br /><i>Gwiwon Nam, U of Florida</i><br /><i>Fanjue Liu, U of Florida</i> <a href='https://ica2023.cadmore.media/object/451094' style='text-decoration: none; background-color: #789F90; color: #FFFFFF; padding: 5px 10px; border: 1px solid #789F90; border-radius: 15px;'>Open Session</a><br /><br />"
We can inspect HTML content by writing it to a temporary file and opening it in the browser. Below is a function that does this automatically for you:
Extracting paper title and authors using a function
I wrote another function for this. You can check some of the panels using the browser: check_in_browser(ica_data_df$desc[100]).
pull_papers <-function(desc) {# we extract the html code starting with the papers line papers <-str_extract(desc, "<b>Papers: </b>.+$") |>str_remove("<b>Papers: </b><br />") |># we split the html by double line breaks, since it is not properly formatted as paragraphsstrsplit("<br /><br />", fixed =TRUE) |>pluck(1)# if there is no html code left, just return NAsif (all(is.na(papers))) {return(list(list(paper_title =NA, authors =NA))) } else {# otherwise we loop through each papermap(papers, function(t) { html <-read_html(t)# first line is the title title <- html |>html_text2() |>str_extract("^.+\n")# at least authors are formatted italice authors <-html_elements(html, "i") |>html_text2()list(paper_title = title, authors = authors) }) }}
# A tibble: 8,169 × 5
panel_id panel_name time paper_title authors
<int> <chr> <chr> <chr> <chr>
1 3113249 The Powers of Platforms 2023-05-2… "Serve the… Changw…
2 3113249 The Powers of Platforms 2023-05-2… "Serve the… Ziyi W…
3 3113249 The Powers of Platforms 2023-05-2… "Serve the… Joel G…
4 3113249 The Powers of Platforms 2023-05-2… "Empowered… Andrea…
5 3113249 The Powers of Platforms 2023-05-2… "Empowered… Jacob …
6 3113249 The Powers of Platforms 2023-05-2… "The Rise … Guy Ho…
7 3113249 The Powers of Platforms 2023-05-2… "Google Ne… Lucia …
8 3113249 The Powers of Platforms 2023-05-2… "Google Ne… Mathia…
9 3113249 The Powers of Platforms 2023-05-2… "Google Ne… Amalia…
10 3112411 Affiliate Journals Top Papers Session 2023-05-2… "One Year … Eloria…
# ℹ 8,159 more rows